Sentiment analysis is a central aspect of NLP, from predicting stock quotes to understanding twitter tweets, we use sentiment analysis to understand the data in our lives. Sentiment analysis is a crucial tool in exploring the language around us in a Machine Learning context; by looking at words and there respective semantics we are able to analyze and decipher large datasets that could never be read entirely by hand. A business can use Sentiment Analysis to study brand awareness or trending opinions.
We used a couple of different approaches to the data, after first processing the data, and analyzing the data through charts and graphs; we looked into the relevant types of Machine Learning algorithms.
After cleaning up the text data by getting rid of characters, stop words, and making consitent casing; the text was presentable for data processing. We then added features which we analized through charts and graphs.
Sentiment: a score of 'positive', 'negative', and 'neutral'
Text: The various tweets we are studying
Unnamed: 2 This one seems superfulous with only one or two values.
Our objective is to create a semi-supervised machine learning model that has the ability to categorize tweets by sentiment. To achieve this, we will be working with a dataset that has been collected. The goal of the model is to predict the sentiment of various tweets.
Here we import relevant packages.
import numpy as np
import pandas as pd
import nltk
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
import plotly.express as px
import re
import string
from nltk.stem import WordNetLemmatizer
string.punctuation
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
This function is required for plotly to run properly in colab without it sometimes the plots don't render properly.
To avoid code repeatability, enable plotly in cell function is created
def enable_plotly_in_cell():
import IPython
from plotly.offline import init_notebook_mode
display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
init_notebook_mode(connected=False)
df = pd.read_csv ('dataset_semi.csv')
df
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 0 | Such an easy app to use, really quick and easy... | positive | NaN |
| 1 | The drivers and the services have been excepti... | positive | NaN |
| 2 | All rides have been enjoyable. | positive | NaN |
| 3 | Driver very knew where I was | neutral | NaN |
| 4 | most driver's are child friendly and patient. | positive | NaN |
| ... | ... | ... | ... |
| 5917 | My Liked Songs can only display all my songs i... | neutral | NaN |
| 5918 | Although it can be a little annoying in the fr... | negative | NaN |
| 5919 | It isn't about the catalogue..it's about the c... | positive | NaN |
| 5920 | Except for the fact that I can't open my downl... | negative | NaN |
| 5921 | This app stinks too many interruptions and upg... | negative | NaN |
5922 rows × 3 columns
Our datafile has our Text, and Sentiment, and a out of place column; unnamed: 2. And a complete set of observations. Reletively clean
Descriptive Analysis
df.shape
(5922, 3)
print(f'We see the dataset has {df.shape[0]} observations and over {df.shape[1]} features.')
We see the dataset has 5922 observations and over 3 features.
Review the data and sample data
df.head()
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 0 | Such an easy app to use, really quick and easy... | positive | NaN |
| 1 | The drivers and the services have been excepti... | positive | NaN |
| 2 | All rides have been enjoyable. | positive | NaN |
| 3 | Driver very knew where I was | neutral | NaN |
| 4 | most driver's are child friendly and patient. | positive | NaN |
df.sample(5)
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 1341 | Its a good app for online study and mettings . | positive | NaN |
| 1875 | An extraordinary accuracy via gps , a must if ... | positive | NaN |
| 4830 | This is a nice app but many more features are ... | positive | NaN |
| 3328 | Very well succeeded, i Love it! | positive | NaN |
| 2944 | No latency or poor connection s | negative | NaN |
We want to look at the datatypes and check to see if they were interpreted correctly.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5922 entries, 0 to 5921 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Text 5922 non-null object 1 Sentiment 5922 non-null object 2 Unnamed: 2 1 non-null object dtypes: object(3) memory usage: 138.9+ KB
Learning the data mathematically
df.describe(include = 'all')
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| count | 5922 | 5922 | 1 |
| unique | 5871 | 5 | 1 |
| top | My music is only a search away! | negative | positive |
| freq | 3 | 2186 | 1 |
From the information we reviewed we could see there are no duplicates. However, we can check again and handle it if required.
Look for missing data
df.isnull().sum()
Text 0 Sentiment 0 Unnamed: 2 5921 dtype: int64
No Null values, in the Text and Sentiment columns
Check for duplicated records
print(f'There are {df.duplicated().sum()} duplicated rows in the dataset.')
There are 43 duplicated rows in the dataset.
df[df['Text'].duplicated() == True]
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 151 | Poor sync. | negative | NaN |
| 951 | working on it | neutral | NaN |
| 1599 | I don't see any reason to install! | negative | NaN |
| 1877 | now after this update i can't dismiss ads.. | negative | NaN |
| 1884 | The add songs feature in playlists doesnt let ... | negative | NaN |
| 1889 | slightly disappointed in you need to buy premi... | negative | NaN |
| 1907 | Too many ads, same ads repeated over and over ... | negative | NaN |
| 1913 | The app have to pay to choose your song of yo... | neutral | NaN |
| 1930 | Bring back my memories with old song with good... | neutral | NaN |
| 1933 | Whenever I go to play a track my app just cras... | negative | NaN |
| 1996 | GPs fail to show properly, drivers are mistake... | negative | NaN |
| 2021 | I'm about to stop using it period! | negative | NaN |
| 2034 | very bad app | negative | NaN |
| 2043 | But the update on my iPhone has been hanging f... | negative | NaN |
| 2045 | Habs already restarted does not bring anything | negative | NaN |
| 2056 | Love the new app! The design is super, and it'... | positive | NaN |
| 2078 | Chat heads don't appear sometimes but over all... | negative | NaN |
| 2101 | An example of what is possible without meeting... | negative | NaN |
| 2353 | Wont change my review until your CS contact me... | neutral | NaN |
| 3546 | Please make this game compatible with iPhone 6... | neutral | NaN |
| 3554 | No respond from them yet on that they are doin... | neutral | NaN |
| 3677 | From the functioncni app, totalne stilted and ... | positive | NaN |
| 3834 | Photos are sent in horrendous quality, with ch... | negative | NaN |
| 3932 | Premium is nice because it's affordable. | positive | NaN |
| 4356 | How in the hell does this app not have dark mo... | neutral | NaN |
| 4364 | Facebook changed a lot through years. | neutral | NaN |
| 4374 | Which incest developer has made such nonsense,... | negative | NaN |
| 4383 | After this last update, when I put my e-mail a... | neutral | NaN |
| 4384 | Unfortunately, despite almost weekly updates o... | negative | NaN |
| 4427 | One more reason to consider when deciding why ... | neutral | NaN |
| 4430 | What will it take to remove this feature? | neutral | NaN |
| 4434 | I've been using the app for several years now. | neutral | NaN |
| 4528 | Help, unwrite me off the paid version! Please! | negative | NaN |
| 4910 | I watch all Yt, etc., only through here to not... | negative | NaN |
| 5323 | I am 19 and why can't i sing up | neutral | NaN |
| 5358 | My music is only a search away! | neutral | NaN |
| 5399 | Wish the app will allow you to add another rou... | positive | NaN |
| 5454 | Your estimated time of arrival is always not c... | negative | NaN |
| 5459 | I call him, message him, he doesn't pick up th... | negative | NaN |
| 5526 | My music is only a search away! | neutral | NaN |
| 5528 | My songs doesn't play automatically.I already ... | negative | NaN |
| 5529 | This used to be a wonderful App. | positive | NaN |
| 5530 | Lately it has been very slow and unresponsive. | negative | NaN |
| 5531 | It barely functions for me and many other user... | negative | NaN |
| 5532 | So unfortunately at this time I would not reco... | negative | NaN |
| 5533 | Very nice app enjoy it very much | positive | NaN |
| 5534 | Spotify has their own DC Universe access deals... | neutral | NaN |
| 5535 | It's official, I hate Spotify multiple people ... | negative | NaN |
| 5536 | Very disappointing to have drivers pick a requ... | negative | NaN |
| 5538 | I just luv their customer service | positive | NaN |
| 5581 | This update doesn t allow me to see my homepag... | negative | NaN |
Double check if the values are actually duplicate and make sure the sentiment is not different
df.loc[df['Text'] == 'Poor sync.']
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 110 | Poor sync. | negative | NaN |
| 151 | Poor sync. | negative | NaN |
df.loc[df['Text'] == 'very bad app']
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 981 | very bad app | negative | NaN |
| 2034 | very bad app | negative | NaN |
Drop all duplicates
df.drop_duplicates(keep=False, inplace=True)
df
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 0 | Such an easy app to use, really quick and easy... | positive | NaN |
| 1 | The drivers and the services have been excepti... | positive | NaN |
| 2 | All rides have been enjoyable. | positive | NaN |
| 3 | Driver very knew where I was | neutral | NaN |
| 4 | most driver's are child friendly and patient. | positive | NaN |
| ... | ... | ... | ... |
| 5917 | My Liked Songs can only display all my songs i... | neutral | NaN |
| 5918 | Although it can be a little annoying in the fr... | negative | NaN |
| 5919 | It isn't about the catalogue..it's about the c... | positive | NaN |
| 5920 | Except for the fact that I can't open my downl... | negative | NaN |
| 5921 | This app stinks too many interruptions and upg... | negative | NaN |
5837 rows × 3 columns
Check duplicates again
df[df['Text'].duplicated() == True]
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 1599 | I don't see any reason to install! | negative | NaN |
| 2021 | I'm about to stop using it period! | negative | NaN |
| 2043 | But the update on my iPhone has been hanging f... | negative | NaN |
| 2045 | Habs already restarted does not bring anything | negative | NaN |
| 2101 | An example of what is possible without meeting... | negative | NaN |
| 4374 | Which incest developer has made such nonsense,... | negative | NaN |
| 4383 | After this last update, when I put my e-mail a... | neutral | NaN |
| 4427 | One more reason to consider when deciding why ... | neutral | NaN |
We stil have duplicates
Let's review these duplicates
df.loc[df['Text'] == 'But the update on my iPhone has been hanging for hours.']
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 1974 | But the update on my iPhone has been hanging f... | neutral | NaN |
| 2043 | But the update on my iPhone has been hanging f... | negative | NaN |
df.loc[df['Text'] == 'I\'m about to stop using it period!']
| Text | Sentiment | Unnamed: 2 | |
|---|---|---|---|
| 2010 | I'm about to stop using it period! | neutral | NaN |
| 2021 | I'm about to stop using it period! | negative | NaN |
For the reamining duplicates we can see that the sentiment was captured differently. We have two same texts with different sentiments. We will treat this by updating them to one sentiment.
df.loc[df['Text'] == 'I\'m about to stop using it period!', "Sentiment"] = "negative"
df.loc[df['Text'] == 'I don\'t see any reason to install!', "Sentiment"] = "negative"
df.loc[df['Text'] == 'I\'m about to stop using it period!', "Sentiment"] = "negative"
df.loc[df['Text'] == 'But the update on my iPhone has been hanging for hours.', "Sentiment"] = "negative"
df.loc[df['Text'] == 'Habs already restarted does not bring anything', "Sentiment"] = "negative"
df.loc[df['Text'] == 'An example of what is possible without meeting on lots of lagg and curl ', "Sentiment"] = "negative"
df.loc[df['Text'] == 'Which incest developer has made such nonsense, heard executed or shared with the horses in 5 ', "Sentiment"] = "negative"
df.loc[df['Text'] == 'After this last update, when I put my e-mail and my password, from the error...', "Sentiment"] = "negative"
df.loc[df['Text'] == 'One more reason to consider when deciding why I should even keep this', "Sentiment"] = "negative"
df.drop_duplicates(keep=False, inplace=True)
df[df['Text'].duplicated() == True]
| Text | Sentiment | Unnamed: 2 |
|---|
We have treated all the duplicates
We created a new list of all the reviews and cleaned out punctuation to make case consistent
Setting up the file to remove punctuation
reviews = []
for index, row in df.iterrows():
reviews.append(row['Text'])
Punctuation removal
punc = '''@1234567890!()-[]{};:'"\,<>./?@#$%^&*_~'''
reviewscleaned = []
for review in reviews:
no_punct = ""
for char in review:
if char not in punc:
no_punct = no_punct + char
reviewscleaned.append(no_punct)
print(reviewscleaned)
Importing stop words, we found we needed to add spanish stop words to the lexicon because many tweets weren't in english
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')
sw_nltk1 = stopwords.words('spanish')
newreviews = []
for text in reviewscleaned:
words = [word for word in text.split() if word.lower() not in (sw_nltk or sw_nltk1)]
new_text = " ".join(words)
newreviews.append(new_text)
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
Checking word
print(newreviews)
We created a cleaned dataframe called df2
df2 = pd.DataFrame()
df2["Sentiment"] = df["Sentiment"]
df2["Text"] = newreviews
df2
| Sentiment | Text | |
|---|---|---|
| 0 | positive | easy app use really quick easy set absolutely ... |
| 1 | positive | drivers services exceptional since ever |
| 2 | positive | rides enjoyable |
| 3 | neutral | Driver knew |
| 4 | positive | drivers child friendly patient |
| ... | ... | ... |
| 5917 | neutral | Liked Songs display songs sort recently added |
| 5918 | negative | Although little annoying free version WAY bett... |
| 5919 | positive | isnt catalogueits curation Spotify |
| 5920 | negative | Except fact cant open downloaded albums Im Off... |
| 5921 | negative | app stinks many interruptions upgrades good do... |
5821 rows × 2 columns
sentiment = []
for index, row in df.iterrows():
sentiment.append(row['Sentiment'])
We apply the stop words to the corpus
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
final_stopwords_list = stopwords.words('english') + stopwords.words('spanish')
input_vector = TfidfVectorizer (max_features=3000, min_df=6, max_df=0.8, stop_words=final_stopwords_list)
newreviews = input_vector.fit_transform(newreviews).toarray()
Here we found some odd results in the Sentiments column of the data
df2['Sentiment'].value_counts()
negative 2134 positive 2120 neutral 1565 it adds a lot of great options by opening doors to new places and experiences. 1 - 1 Name: Sentiment, dtype: int64
We see some non ascii characters in the dataset. This is bad data we will take care of it
def remove_non_ascii(text):
return re.sub(r'[^\x00-\x7F]', ' ', text)
df2['Sentiment'] = df2['Sentiment'].apply(remove_non_ascii)
Capturing the correct set of elements
df_filtered = df2.loc[df2['Sentiment'].isin(['positive', 'neutral', 'negative'])]
df_filtered['Sentiment'].value_counts()
negative 2134 positive 2120 neutral 1565 Name: Sentiment, dtype: int64
This is where we created the datafile for most of the work
df3 = df_filtered
df3.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5819 entries, 0 to 5921 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sentiment 5819 non-null object 1 Text 5819 non-null object dtypes: object(2) memory usage: 136.4+ KB
df3['Sentiment'].isnull().sum
<bound method NDFrame._add_numeric_operations.<locals>.sum of 0 False
1 False
2 False
3 False
4 False
...
5917 False
5918 False
5919 False
5920 False
5921 False
Name: Sentiment, Length: 5819, dtype: bool>
We added character length and word length
Adding features character length and word length
df3['Character_Length'] = df['Text'].str.len()
df3['Word_Length'] = df['Text'].str.count(' ') + 1
Checking value counts for character lengths
df3['Character_Length'].value_counts()
38 100
34 99
46 97
40 94
35 92
...
2123 1
378 1
476 1
469 1
278 1
Name: Character_Length, Length: 353, dtype: int64
df3['Word_Length'].value_counts()
6 591
7 499
8 461
9 394
11 363
...
100 1
68 1
149 1
101 1
71 1
Name: Word_Length, Length: 92, dtype: int64
df3.sample()
| Sentiment | Text | Character_Length | Word_Length | |
|---|---|---|---|---|
| 268 | neutral | Particularly widget hand home screen ability a... | 145 | 24 |
df3.head()
| Sentiment | Text | Character_Length | Word_Length | |
|---|---|---|---|---|
| 0 | positive | easy app use really quick easy set absolutely ... | 101 | 20 |
| 1 | positive | drivers services exceptional since ever | 61 | 10 |
| 2 | positive | rides enjoyable | 30 | 5 |
| 3 | neutral | Driver knew | 28 | 6 |
| 4 | positive | drivers child friendly patient | 45 | 7 |
A sentiment pie chart
enable_plotly_in_cell()
sentiment = df3['Sentiment'].value_counts()
fig = px.pie(sentiment,
values = sentiment.values,
names = sentiment.index,
color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textinfo='percent+label',
marker = dict(line = dict(color = 'white', width = 5)))
fig.show()
WordCloud for top words
text = ' '.join(df3['Text'])
wordcloud = WordCloud(background_color="white").generate(text)
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(stop_words = 'english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
enable_plotly_in_cell()
common_words = get_top_n_words(df['Text'], 20)
df_unigram_20 = pd.DataFrame(common_words, columns = ['Word' , 'count']).sort_values(by="count",ascending=False).reset_index(drop=True)
fig = px.bar(df_unigram_20, x='Word', y='count')
fig.update_layout(
title={
'text': "Top 20 words across all cases",
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
def avg_word(sentence):
words = sentence.split()
if len(words) > 0:
return (sum(len(word) for word in words)/len(words))
return 0
def avg_word_length(df):
df['avg_word'] = df['Text'].apply(lambda x: avg_word(x))
print(df[['Text','avg_word']].head())
avg_word_length(df3)
Text avg_word 0 easy app use really quick easy set absolutely ... 5.30 1 drivers services exceptional since ever 7.00 2 rides enjoyable 7.00 3 Driver knew 5.00 4 drivers child friendly patient 6.75
df3
| Sentiment | Text | Character_Length | Word_Length | avg_word | |
|---|---|---|---|---|---|
| 0 | positive | easy app use really quick easy set absolutely ... | 101 | 20 | 5.300000 |
| 1 | positive | drivers services exceptional since ever | 61 | 10 | 7.000000 |
| 2 | positive | rides enjoyable | 30 | 5 | 7.000000 |
| 3 | neutral | Driver knew | 28 | 6 | 5.000000 |
| 4 | positive | drivers child friendly patient | 45 | 7 | 6.750000 |
| ... | ... | ... | ... | ... | ... |
| 5917 | neutral | Liked Songs display songs sort recently added | 75 | 15 | 5.571429 |
| 5918 | negative | Although little annoying free version WAY bett... | 87 | 17 | 6.125000 |
| 5919 | positive | isnt catalogueits curation Spotify | 72 | 11 | 7.750000 |
| 5920 | negative | Except fact cant open downloaded albums Im Off... | 84 | 16 | 5.222222 |
| 5921 | negative | app stinks many interruptions upgrades good do... | 111 | 17 | 6.230769 |
5819 rows × 5 columns
def hash_tags(df):
df['hashtags'] = df['Text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
print(df[['Text','hashtags']].head())
hash_tags(df3)
Text hashtags 0 easy app use really quick easy set absolutely ... 0 1 drivers services exceptional since ever 0 2 rides enjoyable 0 3 Driver knew 0 4 drivers child friendly patient 0
Use the latest model for Modelling
newreviews2 = input_vector.transform(df3['Text']).toarray()
sentiment2 = []
for index, row in df3.iterrows():
sentiment2.append(row['Sentiment'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newreviews2,sentiment2,train_size=0.8)
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_classifier_score = rf_classifier.score(X_train, y_train)
rf_classifier_score
0.9935553168635876
from sklearn.svm import SVC
svc_classifier = SVC(kernel='linear')
svc_classifier.fit(X_train, y_train)
svc_classifier_score = svc_classifier.score(X_train, y_train)
svc_classifier_score
0.8977443609022556
import sklearn.linear_model as sk
lr_classifier = sk.LogisticRegression(random_state=0, solver='liblinear', multi_class='ovr').fit(X_train, y_train)
lr_classifier_score = lr_classifier.score(X_train, y_train)
lr_classifier_score
0.8850698174006445
from sklearn.metrics import accuracy_score
rf_test = rf_classifier.predict(X_test)
accuracy_scorerf = accuracy_score(y_test, rf_test)
print(accuracy_scorerf)
0.7551546391752577
svc_test = svc_classifier.predict(X_test)
accuracy_scoresvc = accuracy_score(y_test, svc_test)
print(accuracy_scoresvc)
0.7792096219931272
lr_test = lr_classifier.predict(X_test)
accuracy_scorelr = accuracy_score(y_test,lr_test)
print(accuracy_scorelr)
0.7783505154639175
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
lr_confusion_matrix = confusion_matrix(y_test, lr_test)
sns.heatmap(lr_confusion_matrix, annot=True, fmt='g'); #annot=True to annotate cells, ftm='g' to disable scientific notation
svc_confusion_matrix = confusion_matrix(y_test, svc_test)
rf_confusion_matrix = confusion_matrix(y_test, rf_test)
lr_test
array(['negative', 'positive', 'negative', ..., 'negative', 'negative',
'positive'], dtype='<U8')
Plot bar graph with each model
enable_plotly_in_cell()
accuracy_of_models = {'SVC': svc_classifier_score,
'Random Forest': rf_classifier_score,
'Logistic Regression': lr_classifier_score}
fig = px.bar(x = list(accuracy_of_models.keys()), y= list(accuracy_of_models.values()),
color = list(accuracy_of_models.values()),
width = 800, height = 400,
color_discrete_sequence=px.colors.qualitative.G10,
labels={'x':'Classifier', 'y':'Accuracy'}, text_auto=True)
fig.update_layout(title='Accuracy performance of classification models')
fig.show()
Compare confusion matrix for each model
confusion_matrix_of_models = {'SVC': svc_confusion_matrix,
'Random Forest': rf_confusion_matrix,
'Logistic Regression': lr_confusion_matrix}
enable_plotly_in_cell()
scores = [accuracy_scorerf, accuracy_scoresvc, accuracy_scorelr]
best_score = max(scores)
best_model = ''
for key, value in accuracy_of_models.items():
best_model = key
fig = px.imshow(confusion_matrix_of_models[key], text_auto=True, aspect="auto",
color_continuous_scale='viridis',
labels=dict(x="Actual", y="Prediction"))
fig.update_layout(title = f'{key} Matrix', height=500, width=800)
fig.show()
confusion_matrix_of_models.pop(key)
best_score
0.7792096219931272
from IPython.display import Markdown
Markdown(f"""
#### From the results above we can see that {best_model} perfoms best with the highest accuracy of {round(best_score * 100, 2)}%""")
df3
| Sentiment | Text | Character_Length | Word_Length | avg_word | hashtags | |
|---|---|---|---|---|---|---|
| 0 | positive | easy app use really quick easy set absolutely ... | 101 | 20 | 5.300000 | 0 |
| 1 | positive | drivers services exceptional since ever | 61 | 10 | 7.000000 | 0 |
| 2 | positive | rides enjoyable | 30 | 5 | 7.000000 | 0 |
| 3 | neutral | Driver knew | 28 | 6 | 5.000000 | 0 |
| 4 | positive | drivers child friendly patient | 45 | 7 | 6.750000 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 5917 | neutral | Liked Songs display songs sort recently added | 75 | 15 | 5.571429 | 0 |
| 5918 | negative | Although little annoying free version WAY bett... | 87 | 17 | 6.125000 | 0 |
| 5919 | positive | isnt catalogueits curation Spotify | 72 | 11 | 7.750000 | 0 |
| 5920 | negative | Except fact cant open downloaded albums Im Off... | 84 | 16 | 5.222222 | 0 |
| 5921 | negative | app stinks many interruptions upgrades good do... | 111 | 17 | 6.230769 | 0 |
5819 rows × 6 columns
We added two entries that were removed from the dataset.
new_entry = {'Text': "None", 'Sentiment': 'negative', "Unnamed: 2": np.NaN,'Character_Length': np.NaN, 'Word_Length': np.NaN}
df3.loc[578] = new_entry
df3.loc[1837] = new_entry
Here we create a radom arrangement of None values in the sentiment category so as to create a semi-labaled dataset for a semi-supervised model to train on. Sean wrote this with the help of open ai's generative text processor.
percent_sampled = 27
sample_df = df3.sample(frac = percent_sampled/100)
sample_df
| Sentiment | Text | Character_Length | Word_Length | avg_word | hashtags | |
|---|---|---|---|---|---|---|
| 4457 | negative | worst appit many time says please check connec... | 59.0 | 10.0 | 5.375000 | 0.0 |
| 3759 | neutral | real time exchange rates currently one option ... | 92.0 | 17.0 | 5.545455 | 0.0 |
| 1060 | positive | Gud useful online classes | 39.0 | 7.0 | 5.500000 | 0.0 |
| 1925 | negative | like stuff please ask language preferences Ind... | 124.0 | 22.0 | 6.090909 | 0.0 |
| 4972 | positive | oBsuzjsiznOAmsy visbss see mm es la PayPal hotel | 57.0 | 11.0 | 5.125000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 5473 | negative | Theres reason streaming app take GB | 56.0 | 12.0 | 5.000000 | 0.0 |
| 5484 | neutral | Ive tried every option Gmail using phone number | 60.0 | 11.0 | 5.000000 | 0.0 |
| 4883 | negative | never get notifications pagesfrustrating mobil... | 153.0 | 26.0 | 7.000000 | 0.0 |
| 1957 | neutral | used possible choose nickname set rambling let... | 130.0 | 25.0 | 6.000000 | 0.0 |
| 4268 | negative | post posted anything bad | 49.0 | 12.0 | 5.250000 | 0.0 |
1572 rows × 6 columns
df3.loc[df3.Text.isin(sample_df.Text), "Sentiment"] = "None"
df3
| Sentiment | Text | Character_Length | Word_Length | avg_word | hashtags | |
|---|---|---|---|---|---|---|
| 0 | positive | easy app use really quick easy set absolutely ... | 101.0 | 20.0 | 5.300000 | 0.0 |
| 1 | positive | drivers services exceptional since ever | 61.0 | 10.0 | 7.000000 | 0.0 |
| 2 | positive | rides enjoyable | 30.0 | 5.0 | 7.000000 | 0.0 |
| 3 | None | Driver knew | 28.0 | 6.0 | 5.000000 | 0.0 |
| 4 | positive | drivers child friendly patient | 45.0 | 7.0 | 6.750000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 5919 | positive | isnt catalogueits curation Spotify | 72.0 | 11.0 | 7.750000 | 0.0 |
| 5920 | negative | Except fact cant open downloaded albums Im Off... | 84.0 | 16.0 | 5.222222 | 0.0 |
| 5921 | negative | app stinks many interruptions upgrades good do... | 111.0 | 17.0 | 6.230769 | 0.0 |
| 578 | negative | None | NaN | NaN | NaN | NaN |
| 1837 | negative | None | NaN | NaN | NaN | NaN |
5821 rows × 6 columns
df3.head(25)
| Sentiment | Text | Character_Length | Word_Length | avg_word | hashtags | |
|---|---|---|---|---|---|---|
| 0 | positive | easy app use really quick easy set absolutely ... | 101.0 | 20.0 | 5.300000 | 0.0 |
| 1 | positive | drivers services exceptional since ever | 61.0 | 10.0 | 7.000000 | 0.0 |
| 2 | positive | rides enjoyable | 30.0 | 5.0 | 7.000000 | 0.0 |
| 3 | None | Driver knew | 28.0 | 6.0 | 5.000000 | 0.0 |
| 4 | positive | drivers child friendly patient | 45.0 | 7.0 | 6.750000 | 0.0 |
| 5 | None | Quick easy use drivers quite friendly 😊 | 54.0 | 11.0 | 4.714286 | 0.0 |
| 6 | None | Love Appits easy☮️shows person drive u Name | 88.0 | 17.0 | 5.285714 | 0.0 |
| 7 | None | Best drivers ever | 17.0 | 3.0 | 5.000000 | 0.0 |
| 8 | positive | Good travel app | 24.0 | 5.0 | 4.333333 | 0.0 |
| 9 | positive | Cabs r clean drivers | 26.0 | 6.0 | 4.250000 | 0.0 |
| 10 | positive | Love rides | 14.0 | 3.0 | 4.500000 | 0.0 |
| 11 | positive | Fast affordable efficient means get destinatio... | 72.0 | 12.0 | 7.000000 | 0.0 |
| 12 | positive | Perfect transport | 17.0 | 2.0 | 8.000000 | 0.0 |
| 13 | negative | rider vey wicked use add money | 58.0 | 13.0 | 4.166667 | 0.0 |
| 14 | positive | easiest way find transport safer | 52.0 | 11.0 | 5.600000 | 0.0 |
| 15 | positive | Safe travel | 19.0 | 4.0 | 5.000000 | 0.0 |
| 16 | positive | Always good ride good drivers | 40.0 | 8.0 | 5.000000 | 0.0 |
| 17 | positive | kids loved spacious ride | 33.0 | 6.0 | 5.250000 | 0.0 |
| 18 | positive | enjoyed ride | 17.0 | 4.0 | 5.500000 | 0.0 |
| 19 | None | Clean cars | 11.0 | 2.0 | 4.500000 | 0.0 |
| 20 | positive | Best service best prices | 29.0 | 5.0 | 5.250000 | 0.0 |
| 21 | positive | Fast convenient friendly drivers smile | 52.0 | 8.0 | 6.800000 | 0.0 |
| 22 | positive | never encounted bad experience even drivers pr... | 75.0 | 12.0 | 7.142857 | 0.0 |
| 23 | positive | convenient way moving around ever | 42.0 | 7.0 | 5.800000 | 0.0 |
| 24 | None | Nice rides ☺️ | 14.0 | 3.0 | 3.666667 | 0.0 |
Importing relevant modules
from sklearn.linear_model import LogisticRegression
Split the data into labeled and unlabeled
labeled_data = df3[df3['Sentiment'] != 'None']
unlabeled_data = df3[df3['Sentiment'] == 'None']
unlabeled_data.sample(20)
| Sentiment | Text | Character_Length | Word_Length | avg_word | hashtags | |
|---|---|---|---|---|---|---|
| 3715 | None | Ive experienced simple quick cheap way send mo... | 86.0 | 16.0 | 5.666667 | 0.0 |
| 5700 | None | free love | 36.0 | 10.0 | 4.000000 | 0.0 |
| 1144 | None | ask youve today ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃ... | 555.0 | 9.0 | 131.250000 | 0.0 |
| 3188 | None | play game internet availible | 40.0 | 7.0 | 6.250000 | 0.0 |
| 3622 | None | support team tencent | 28.0 | 5.0 | 6.000000 | 0.0 |
| 1071 | None | badI cant install phone | 35.0 | 7.0 | 5.000000 | 0.0 |
| 976 | None | Bring seen sign shows messag seen without ente... | 94.0 | 16.0 | 6.111111 | 0.0 |
| 5201 | None | nice meet communicate others see friends | 70.0 | 15.0 | 5.833333 | 0.0 |
| 4172 | None | instant transfers international banks count do... | 131.0 | 19.0 | 7.250000 | 0.0 |
| 1180 | None | new layout fine feature loved gone | 66.0 | 15.0 | 4.833333 | 0.0 |
| 1524 | None | ÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂðÃÂÃÂÃÂ... | 16349.0 | 1.0 | 16349.000000 | 0.0 |
| 5047 | None | lot ads like ads play next song | 67.0 | 16.0 | 3.571429 | 0.0 |
| 1175 | None | love since update everything | 47.0 | 10.0 | 6.250000 | 0.0 |
| 1348 | None | great app tution classes | 37.0 | 8.0 | 5.250000 | 0.0 |
| 1776 | None | one look antalya Alanya | 52.0 | 10.0 | 5.000000 | 0.0 |
| 627 | None | expensive dont know unsubscribe | 55.0 | 10.0 | 7.000000 | 0.0 |
| 3745 | None | weeks transfer arrived promised returne back a... | 133.0 | 26.0 | 6.000000 | 0.0 |
| 4145 | None | access search tab | 24.0 | 6.0 | 5.000000 | 0.0 |
| 102 | None | good app always useful updates ment ease user ... | 130.0 | 22.0 | 5.800000 | 0.0 |
| 1402 | None | Dated nonmaterial design UI lacks basic functi... | 99.0 | 15.0 | 6.363636 | 0.0 |
labeled_data.sample(20)
| Sentiment | Text | Character_Length | Word_Length | avg_word | hashtags | |
|---|---|---|---|---|---|---|
| 2963 | positive | like fact took highlights Facebook | 58.0 | 12.0 | 6.000000 | 0.0 |
| 5651 | positive | love use also phone time working computer | 93.0 | 22.0 | 5.000000 | 0.0 |
| 5320 | positive | never used kind app im download app fun classe... | 96.0 | 18.0 | 5.090909 | 0.0 |
| 1795 | positive | previous version fits | 26.0 | 4.0 | 6.333333 | 0.0 |
| 2397 | neutral | get score coins | 33.0 | 10.0 | 4.333333 | 0.0 |
| 3998 | negative | media section select image option see chat lik... | 98.0 | 20.0 | 5.333333 | 0.0 |
| 5448 | positive | far enjoying bolt rides | 43.0 | 9.0 | 5.000000 | 0.0 |
| 5489 | positive | beauty journey lies drivers | 46.0 | 9.0 | 6.000000 | 0.0 |
| 4759 | positive | pretty good game ðð tho could use less b... | 405.0 | 82.0 | 5.325000 | 0.0 |
| 4775 | negative | Worst game ever game totally force spend money | 58.0 | 11.0 | 4.875000 | 0.0 |
| 411 | positive | worried easy every way Ive tried transfer mone... | 124.0 | 21.0 | 6.272727 | 0.0 |
| 5696 | positive | Surely best flexible time tracking apps | 60.0 | 11.0 | 5.666667 | 0.0 |
| 5658 | positive | point use app | 25.0 | 6.0 | 3.666667 | 0.0 |
| 5880 | negative | new update completely wiped LIKED PLAYLISTdo s... | 98.0 | 15.0 | 6.400000 | 0.0 |
| 3431 | negative | PREMIUM since since problem downloaded songs d... | 155.0 | 30.0 | 6.300000 | 0.0 |
| 1683 | positive | good lesson app designers | 58.0 | 13.0 | 5.500000 | 0.0 |
| 1651 | positive | Excellent content could much better designed i... | 67.0 | 10.0 | 6.857143 | 0.0 |
| 3766 | neutral | steps needed send money wallet | 37.0 | 7.0 | 5.200000 | 0.0 |
| 4909 | positive | good app although Easy fast love gave three star | 71.0 | 17.0 | 4.444444 | 0.0 |
| 1667 | negative | Pubg Mobile Good Game Butt hacker high fast gr... | 114.0 | 20.0 | 5.000000 | 0.0 |
Preprocess and extract features
labeled_data['Sentiment']
0 positive
1 positive
2 positive
4 positive
8 positive
...
5919 positive
5920 negative
5921 negative
578 negative
1837 negative
Name: Sentiment, Length: 4242, dtype: object
sentiment_labelled = []
for index, row in labeled_data.iterrows():
sentiment_labelled.append(row['Sentiment'])
X = input_vector.transform(labeled_data['Text']).toarray()
y = sentiment_labelled
import sklearn.linear_model as sk
model = sk.LogisticRegression(random_state=0, solver='liblinear', multi_class='ovr').fit(X, y)
model_score = model.score(X, y)
model_score
0.8781235266383781
Creating a seperate file to compare to df3 later on
df4 = df3
Predict labels for the unlabeled data
X_unlabeled = input_vector.transform(unlabeled_data['Text']).toarray()
predicted_labels = model.predict(X_unlabeled)
Here we created a series of columns for training on later, I need to fix these laels
df4['original'] = df3['Sentiment']
df4.loc[df4['Sentiment'] == 'None', 'Sentiment'] = predicted_labels
df4['predicted'] = df4[['Sentiment']]
df4['Sentiment'] = df3[['Sentiment']]
Creating a final datafile with the correct labels
df5 = df4[['Sentiment', 'original', 'predicted']]
df5
| Sentiment | original | predicted | |
|---|---|---|---|
| 0 | positive | positive | positive |
| 1 | positive | positive | positive |
| 2 | positive | positive | positive |
| 3 | negative | None | negative |
| 4 | positive | positive | positive |
| ... | ... | ... | ... |
| 5919 | positive | positive | positive |
| 5920 | negative | negative | negative |
| 5921 | negative | negative | negative |
| 578 | negative | negative | negative |
| 1837 | negative | negative | negative |
5821 rows × 3 columns
Change all occurances of words into a score
Function that turns a dataframe into numerical values
def ConvertSentiment(score):
if score == 'positive':
return 1
elif score == 'neutral':
return 0
else:
return 2
return score
Applying a score for each sentiment
df5['Sentiment'] = df5['Sentiment'].apply(ConvertSentiment)
df5['predicted'] = df5['predicted'].apply(ConvertSentiment)
df5.head(60)
| Sentiment | original | predicted | |
|---|---|---|---|
| 0 | 1 | positive | 1 |
| 1 | 1 | positive | 1 |
| 2 | 1 | positive | 1 |
| 3 | 2 | None | 2 |
| 4 | 1 | positive | 1 |
| 5 | 1 | None | 1 |
| 6 | 1 | None | 1 |
| 7 | 1 | None | 1 |
| 8 | 1 | positive | 1 |
| 9 | 1 | positive | 1 |
| 10 | 1 | positive | 1 |
| 11 | 1 | positive | 1 |
| 12 | 1 | positive | 1 |
| 13 | 2 | negative | 2 |
| 14 | 1 | positive | 1 |
| 15 | 1 | positive | 1 |
| 16 | 1 | positive | 1 |
| 17 | 1 | positive | 1 |
| 18 | 1 | positive | 1 |
| 19 | 1 | None | 1 |
| 20 | 1 | positive | 1 |
| 21 | 1 | positive | 1 |
| 22 | 1 | positive | 1 |
| 23 | 1 | positive | 1 |
| 24 | 1 | None | 1 |
| 25 | 1 | positive | 1 |
| 26 | 1 | None | 1 |
| 27 | 1 | None | 1 |
| 28 | 1 | positive | 1 |
| 29 | 1 | positive | 1 |
| 30 | 1 | positive | 1 |
| 31 | 1 | None | 1 |
| 32 | 1 | None | 1 |
| 33 | 1 | positive | 1 |
| 34 | 1 | positive | 1 |
| 35 | 1 | positive | 1 |
| 36 | 1 | None | 1 |
| 37 | 1 | positive | 1 |
| 38 | 2 | negative | 2 |
| 39 | 2 | negative | 2 |
| 40 | 1 | None | 1 |
| 41 | 1 | positive | 1 |
| 42 | 2 | negative | 2 |
| 43 | 2 | negative | 2 |
| 44 | 2 | negative | 2 |
| 45 | 0 | None | 0 |
| 46 | 2 | None | 2 |
| 47 | 2 | negative | 2 |
| 48 | 2 | None | 2 |
| 49 | 2 | None | 2 |
| 50 | 2 | negative | 2 |
| 51 | 2 | negative | 2 |
| 52 | 2 | negative | 2 |
| 53 | 2 | negative | 2 |
| 54 | 2 | negative | 2 |
| 55 | 2 | negative | 2 |
| 56 | 2 | negative | 2 |
| 57 | 2 | negative | 2 |
| 58 | 2 | None | 2 |
| 59 | 2 | negative | 2 |
Loaded and caluculated the error for the model
from sklearn.metrics import mean_squared_error
predicted_values = df5['predicted']
actual_values = df5['Sentiment']
mse = mean_squared_error(actual_values, predicted_values)
rmse = mse ** 0.5
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Squared Error (MSE): 0.0 Root Mean Squared Error (RMSE): 0.0
We save the model. We will use this saved model to build our web application
import joblib
joblib.dump(input_vector, 'vector.pkl')
['vector.pkl']
joblib.dump(model, 'senitment_analysis.pkl')
['senitment_analysis.pkl']
Let's quickly test our modal to see it predicts based on our analysis
test_model = joblib.load('senitment_analysis.pkl')
result = test_model.score(X, y)
print(result)
0.8781235266383781
Hence, we can conclude that the Logistic Regression is the best model to use for our dataset. The Random forest model was a close second. In the end, we decided to go with Logistic Regression due to it's higher accuracy
Mehreen Saeed, Modeling Pipeline Optimization With scikit-learn URL - https://machinelearningmastery.com/modeling-pipeline-optimization-with-scikit-learn/
Pratik Parmar, Enable plotfly in a cell in colab URL - https://stackoverflow.com/a/54771665
Build a function to get dicitonaries -URL - https://stackoverflow.com/questions/8653516/search-a-list-of-dictionaries-in-python
Gilbert Tanner, Building web app with streamlit and deploying wit Heroku - URL - https://gilberttanner.com/blog/deploying-your-streamlit-dashboard-with-heroku/
M.A. Al-Barrak,Muna S. Al-Razgan, Predicting students’ performance through classification: Journal of Theoretical and Applied Information Technology 75(2):167-175 URL - https://www.researchgate.net/publication/282381796_Predicting_students'_performance_through_classification_A_case_study
This file was generated using The NBConvert, additional information on how to prepare articles for submission is here.
The article itself is an executable colab Markdown file that could be downloaded from Github with all the necessary artifacts.
Link to the web application - Sentiment Analysis
Kunwar Rajdeep Singh - York University School of Continuing Studies